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Abstract 



There has been a lot of interest in developing algorithms to extract 
^ ■ clusters or communities from networks. This work proposes a method, 

based on blockmodelling, for leveraging communities and other topological 
features for use in a predictive classification task. Motivated by the issues 
faced by the field of community detection and inspired by recent advances 
^ ' in Bayesian topic modelling, the presented model automatically discovers 

topological features relevant to a given classification task. In this way, 
rather than attempting to identify some universal best set of clusters for 
an undefined goal, the aim is to find the best set of clusters for a particular 
purpose. Using this method, topological features can be validated and 
' assessed within a given context by their predictive performance. 

The proposed model differs from other relational and semi-supervised 
learning models as it identifies topological features to explain the classi- 
fication decision. In a demonstration on a number of real networks the 
predictive capability of the topological features are shown to rival the per- 
1 ^ ■ formance of content based relational learners. Additionally, the model is 

rN( I shown to outperform graph-based semi-supervised methods on directed 

^ ■ and approximately bipartite networks. 

Keywords: Social Networks, Blockmodelling, Node Classification. 



1 Introduction 

Networks are found everywhere, spanning a range of domains including physics, 
sociology, biology, and computer science. Where traditional knowledge dis- 
covery methods have focused on data attribute values for learning tasks, the 
interrelations between data instances have now become an important area of 
research. This work considers the automatic discovery of topological features 
which are relevant to a predictive classification task. The work is motivated by 
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Figure 1: Examples of the topological features (network positions) which are 
used for classification. The colours indicate nodes of the same network position. 
Network positions are defined by their connections to each of the network po- 
sitions. Positions may be assortative (top left), tending to link to others of the 
same position, or disassortative (bottom right), linking to positions different to 
themselves, network positions. 

some of the issues facing the field of community detection and inspired by 
recent advances in Bayesian topic modelling |2| . 

An example application domain which has benefited from the use of network 
structures is fraud detection [SHi] ■ Typical structural indicators of fraud include: 
communities of interest [3] (fraudulent entities are closer to other fraudsters), 
structural equivalence ^ (aliases of a fraudulent entity link to the same set 
of entities), and bipartite networks [5] (fraudsters generating normal activity 
using "accomplice" entities). It is the aim of this work to identify these types of 
topological features to predict the unknown class labels in a partially labelled 
network. 

The topological features which have received the most attention in recent 
years are communities or modules; groups of highly connected nodes within a 
globally sparse network. The task of community detection is to identify these 
groups. The motivations for conducting community detection are based on the 
principles of assortative mixing; highly connected nodes share common proper- 
ties or attributes. 
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Although the problem of community detection may be well known and well 
studied [HIBHH], it still lacks a formal definition and, in close relation to this, 
an accepted method for performance assessment. Common approaches involve 
either scoring algorithms by the communities they detect using the Newman- 
Givan modularity function or to measure their ability to recover known 
partitions in the network. 

Modularity measures the difference between the observed linkages within 
clusters and the expected connectivity of a random graph with the same degree 
distribution. However, there are a number of well documented problems with 
this function such as the resolution limit [5] and the more recently discovered 
degeneracy problem |10| . The degeneracy issue is an important one which pro- 
vides motivation for this work. The modularity function has been found to 
have an exponentially large number of high modularity solutions which are dis- 
tinctly different from each other. The implications are that modularity scores 
alone cannot indicate how good the identified clusters are. The problem is fur- 
ther compounded by hierarchical and overlapping communities. Furthermore, 
modularity scores are not comparable across networks. 

The alternative for performance assessment is to use community detection to 
recover some known partition by either seeding a true partition in a generated 
network [11] or working with real networks for which a partition is well known; 
the networks of Sampson's monastery [T^] and Zachary's karate club [13] are 
two popular examples. However, in many real networks there are many dimen- 
sions in which a "true" partition could lie e.g. in a social network, partitions 
could relate to family relations, business departments, social interest groups, 
ethnicity, or residential location. This implies that any known partition of a 
given network may not be the only valid one and therefore the ability to recover 
a particular partition which happens to be known seems to be an unfair test 
for an unsupervised method. The multi-dimensionality of human relations (and 
other complex systems represented by networks) has provided the motivation 
for algorithms to detect overlapping communities [7J[S1[T3], but validation of 
these models are difficult without complete knowledge of the data. 

The work presented here, instead of attempting to optimise a function known 
to be problematic or to solve an essentially supervised problem without any 
training, the aim is to try to use these structures as features of the network that 
can be used in a predictive capacity. In this way, the clusters found are suitable 
for a particular purpose, i.e. the prediction of node class labels. This addresses 
the validation problem, as the clusters found can be assessed according to their 
predictive performance in a given classification task. 

The proposed model is based on blockmodels which have been previously 
used for community detection, such as in [15| where a blockmodelling approach 
is formulated in such a way that modularity optimisation is special case. Block- 
models also allow the discovery of other types of (disassortative) features which 
characterise the inter-cluster/community relations and therefore allows a wider 
search space of topological features. Furthermore blockmodels provide a de- 
scriptive summary of the network; the interpretation of which is covered by 
extensive literature [TBUlSj . 
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The proposed model is demonstrated on citation, blog and word networks 
to identify features in the network topology which are then used to predict the 
classification of unlabeled nodes in the network. These features are referred to as 
"network positions" and a community is special type of position (Figure [T|). The 
performance of the model on citation networks is found to rival other relational 
learners, without the use of node attributes (word frequencies). Finally the 
model is shown to perform well on (approximately) bipartite networks. 



The work presented draws upon previous work from the fields of community 
detection, blockmodelling and topic modelling. 

In the past decade community detection has received a lot of attention from 
researchers spanning a wide range of domains and has consequently produced a 
wide range of approaches to the problem; for a comprehensive review of these 
approaches see [H] . Although there is no commonly accepted formal definition 
for community detection, a lot of approaches focus on the optimisation of the 
Newman-Givan modularity function The Newman-Givan modularity func- 
tion, Q, of a network partition made up of K disjoint communities is given 



where is the number of edges in community fc, / is the number of edges in 
the network and dk is the total degree of the nodes in community k. Commonly 
used as it is, the modularity function has been found to suffer from a number 
of issues such as a resolution limit ^ and degeneracy of solutions ^Oj . Rather 
than seek to find a globally optimum set of clusters, the proposed approach 
addresses the degeneracy issue by attempting to find clusters which are able to 
predict the class of a node. 

In part, due to the lack of formal problem definition, community detection 
lacks a commonly accepted evaluation procedure. To address this [IT proposes 
a benchmark graph generator and in later work (20j goes on to give a compar- 
ative analysis of a selection of community detection algorithms. Recognising 
that different community detection algorithms produce desirable outputs under 
different circumstances, the work in |21) considered the problem of selecting 
an appropriate community detection algorithm based on the properties of the 
network and community structure. These methods however rely on there be- 
ing a single true partition, which in many cases is unreasonable. The work 
presented here instead considers that there may be alternative partitions and 
instead searches for the partition relevant to a particular context. 

Blockmodels have been used for social and psychometric analysis for decades 
[16j[22l[23] . The name refers to the "blocks" of zero and non-zero elements that 
occur in the adjacency matrix when the rows and columns are reordered such 
that nodes with similar interaction patterns are adjacent. The clusters of nodes 



2 Related Work 



by: 




4 



which make up these block patterns are known as network positions. The 
pattern of interactions between positions provides a summary of the interactions 
of the network. Bayesian formulations of blockmodelling, known as stochastic 
blockmodelling [23], were developed to create blockmodels by automatically 
assigning nodes to positions. Other extensions of the stochastic blockmodel in- 
clude the Mixed Membership Stochastic Blockmodel (MMSB) [14 which allows 
nodes to belong to multiple positions and the non-parametric approaches of 
the Infinite Relational Model ^25] and the Latent Feature Relational Model 
which automatically determines the number of positions. While these mod- 
els display statistically elegant solutions, they do so at a computational cost. 
Some more recent attempts at scalable probabilistic models of social networks 
include the Interaction Component Model for communities (ICMc) [27] model, 
the Simple Social Network using Latent Dirichlet Allocation (SSN-LDA) [28] 
and Marginal Product Mixture Model (MPMM) [29]. These models assign la- 
tent classes to each transaction rather than modelling all possible interactions 
and so provides the greatest benefit when the networks are sparse. The associ- 
ated limitations of these approaches are that, either the receivers and senders 
are generated from disjoint sets of positions, and/or the memberships of nodes 
to positions are no longer explicitly inferred. The first case refers to the SSN- 
LDA and MPMM models where receivers are generated from one set of network 
positions and the senders from another set. This makes models inappropriate 
for analysis of community structures i.e. because a community is a position in 
which nodes interact with others of the same position. This requires sender and 
receiver positions to be the same. In the second case, applicable to ICMc and 
MPMM, the effect is that rather than model the probability of an interaction 
given the positions of the sender and receiver (i.e. P{i = l|A:i,fc2) where fci and 
^2 are positions of sender and receiver), these model the probability of a sender- 
receiver pair of positions given an interaction (i.e. P{ki,k2\i = 1)). An added 
advantage of these approaches is that they are capable of modelling networks 
with weighted links, where previous blockmodel approaches merely modelled 
the presence or absence of a relation. 

The main inspiration for this work comes from supervised Latent Dirichlet 
Allocation (sLDA) [20] which, as the name suggests, is a supervised extension of 
the topic modelling approach LDA [2] . Latent Dirichlet Allocation is a method 
for clustering a corpus of documents into mixed membership topics. The sLDA 
model extends the LDA approach to identify topics in documents which not 
only best describes the document structures but also to predict a known re- 
sponse variable (i.e. a classification or regression target) associated with each 
document. 

The proposed model shares some similarity to graph based semi-supervised 
learning methods such as [3TH51] which aim to label the unlabeled nodes of 
a partially labelled graph. However, these methods assume assortativity, i.e. 
nodes of the same class tend to link to each other, whereas the method herein 
does not; this is demonstrated in section l42l 

Also related are the collective classification models such as the work of [35] 
and [36]. Collective classification is a type of relational learning which uses 
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Figure 2: A graphical model representation of the supervised blockmodel for 
sparse networks. Note that z is a pair of 1 of X indicator variables {zs,Zr) 
drawn from a multinomial of length K'^ 



object attributes together with information about the links between objects to 
predict class labels. The work presented here does not use content information 
and relies only on the link structure of the network. 



3 Proposed Model 

Traditional blockmodelling is based on the idea that a set of latent features, 
referred to as network positions, are responsible for the interaction patterns 
between nodes in a network. These positions represent groups of nodes which 
have similar interaction patterns. Nodes of one position link to other nodes of 
another position with the same probability. If nodes of a particular position 
prefer to link to other nodes of the same position, then this position repre- 
sents a community. The early approaches to blockmodelling involved manually 
populating the positions using attribute values to group entities. More recent 
contributions [14l[24l[25] use probabilistic inference to determine the positions 
and validate the models against node attributes. 

The proposed model lies somewhere in the middle of these approaches as 
it infers position membership based on the network structure, but uses class 
label information to coerce the position assignment such that the positions are 
relevant to the classification problem. The model allows multiple network posi- 
tions to make up a class (node label), or for a particular network position to be 
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incorporated in more than one class. The model also allows for nodes to take on 
different positions for different interactions. The model is formulated by com- 
bining the sLDA model of [37] to an adaptation of the ICMc model given in [55] 
to provide a supervised model which is scalable to large, but sparse networks. 

The proposed Supervised Blockmodel for Sparse Networks (SBSN) assumes 
the following generative process: 

1. For a given network draw a distribution over the possible network 
position interactions tt ^ Dirichlet{a) 

2. For each position k e {1, 2, K}: 

(a) Draw a distribution over nodes (j) ^ Dirichlet{(3) 

3. For each interaction i G {1, 2, /}: 

(a) Draw a position interaction pair Zi = (zg, Zi ~ Multinomial (tt) 

(b) Draw a sender node Si ^ Multinomial ((j)z^) 

(c) Draw a receiver node ~ Multinomial{(j)z^) 

4. For each node v e {1, 2, N}: 

(a) Draw a class label ^ Softmax{ri, Zy) where Zy — Zg-Ss^^v + z^Sr-^^ 

and Zg. and z^. are the indicator vectors of length K describing the 
network position of the sender and receiver nodes in interaction i, S._. 
is the Kronecker delta. Zy therefore represents the empirical behav- 
ior class frequencies for node v. The softmax function provides the 
following distribution: 

Inference of the network positions, Zg and Zr, and estimation of the model 
parameters, (tt, 0) are undertaken using an approximation of the Expectation- 
Maximisation (EM) algorithm, variational Bayesian EM (VBEM) 39 . To start 
with the intractable likelihood is lower bounded using an arbitrary distribution 

^ Conceptually, it is best to think of tt as a matrix with dimensions K X K such that each 
element ttj.^ j.^ represents the probability of observing a link from ki to fc2- 
^The hyperparameters a and /3 are fixed. 
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q and Jensen's inequality to give the variational free energy Q) : 



logp(s,r,y|e) 



= log / p{z,(f>,TT,s,r,y\<d)dTrd(f>dz 




= Eq[\ogp{z,(f>,TT,s,r,y\e)] - £'Jg(z, 0, vr) 
= J(g,e). 



(1) 



In regular EM 3^{q, 9) is optimised with respect to the q distribution (E-step) 
and the model parameters (M-step) iteratively. In the E-step the lower bound 
is saturated by setting q equal to the posterior distribution over latent variables 
(positions). However in this computing the posterior is intractable, the 

posterior is approximated the with the fully factored distribution: 



where A, C and tu are the variational parameters of the approximated posterior. 
For brevity the full derivation of the variational E and M steps is omitted and 
only the update functions are included here. The update steps were derived by 
closely following the procedure outlined in |37| and uses the same approxima- 
tions and identities. The interested reader may wish to refer to the appendix 
in [13] for a full derivation of the variational EM algorithm for the MMSB or [SD] 
for further details of Supervised LDA. 

3.1 Variational E-Step 

Inference of the network positions is achieved by optimising the variational free 
energy 3^{q, O) in ([1]) with respect to the approximating q distribution. 



where '^{■) is the derivative of the log Gamma function, Uv is the number of times 
node V is involved in an interaction and A„ is a X-length vector representing 



/ K 



g(z,(/>,7r) = Dir{7rk,,k2\(^)Y[q{zi\Xi)Y[Dir{(l)k\0, 
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the marginal probability of sender or receiver positions, i.e. 



A.. = 



. k2 k2 k2 



ki 



ki 



ki 



and hf^Xy represents an approximation to the expectation under the q distribu- 
tion of the normalising function of the softmax distribution for a node v. This 
hi^y is given by: 

/.,.=^exp(^^) n EV'=°-P(^) • 



3.2 Variational M-Step 



The variational M-step updates the posterior distribution over the model pa- 
rameters which maximise the variational free energy 3^{q, Q) for the current 
distribution over the latent variables z. 
I 

(kuv = ^Ssi,v^Xi^kiM2 + Sn,v^Xi^k2,ki +13, 

i=l k2 k2 

I 

^kiM = E ^i.kiM 



a, 



7] is found using conjugate gradient to optimise the free energy terms of ([1} 
corresponding to 77: 

V c i:vG \ k \ ^ 

{si,ri} 

where Xy — Xg^Sg-.v + X^Sr-.v Conjugate gradient requires the following 

derivatives: 



E 



^ ;^A.,,exp(^) 

■ v£{si,ri} J2l ^v-l 
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3.3 Prediction 



Predicting unlabeled nodes requires inference of the network positions given the 
rest of the network. As the class label is unknown the inference is performed as 
in Section 13.11 but without the terms involving rj. Classification of a test node 
is given by: 



4 Experimental Results 

This section describes the experiments performed using the SBSN model. In 
all experiments the parameters a and /3 were fixed at 2.0 and ^ respectively; 
these values were found in general to give good classification performance. The 
A and 77 parameters were initialised randomly. In each classification experiment 
the SBSN model was fit to the whole network by iterating the variational E and 
M steps described in the previous section until convergence of the free energy. 
Only a randomly selected proportion of the network nodes in each experiment 
were labelled (training set), the remaining nodes were used as a test set. 



Four publicly available datasets were used to demonstrate classification using 
the model. The first two are the citation networks Cora and Citeseer datasets 
from [40] which consists of 2708 and 3312 nodes representing papers and 5429 
and 4732 edges representing citations respectively. The Cora dataset contains 
papers from 7 categories and Citeseer has 6. The third is the AGBlog dataset, 
the largest connected component from the graph of the political blog dataset 
found in j41j. This network has 1222 nodes labeled "Liberal" or "Conserva- 
tive" and 19021 edges connecting them. The fourth dataset is a word network 
from [42] comprised of the 112 most frequently occurring nouns and adjectives 
in the novel David Copperfield by Charles Dickens. This network contains a 
link between words whenever they appear adjacent to each other. It is approxi- 
mately bipartite and is used to demonstrate the SBSN model's ability to extract 
disassortative features. 

4.2 Classification 

Experiments were run to investigate the effect of various factors on the classifica- 
tion performance; the maximum number of positions (K), initialisation of (j), and 
the proportion of the network that was labelled (training set size) . Performance 
is measured according to the Macro-averaged Fl measure given by: 



y* = arg max Eg [VyZy] 



i/G{l,...,' 



arg max 77 J A- 
''ye{i,...,c} 'y 



4.1 Data 



2TP 



2TP + FN + FP' 
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where TP, FN, and FP correspond to the true positive, false negative and false 
positive rates respectively. The Fl measure represents the harmonic mean of 
the precision and recall values. For the multi-class problems the macro-average 
is used - i.e. the Fl score is calculated for each class and then averaged. Each 
experiment was run 25 times and the performance scores reported reflect the 
average over these runs. 

Figure |3] shows the classification performance on the Citeseer and Cora 
datasets for different values of the model parameter K, the maximum num- 
ber of positions. As a baseline, the performance of the unsupervised version 
of the model is also shown (i.e. the terms containing node labels were omitted 
in the update described in section [O]) . Each data point represents the mean 
of 25 tests and the error bars represent one standard deviation. For each run, 
two thirds of the network were randomly selected for training and the remain- 
ing nodes used for testing. It can be seen that the performance improves as K 
increases, however on closer inspection it was found that often the membership 
for a lot of these positions tended to zero (due to the regularisation effect of 
the hyperparameters) . It is curious then to see that increasing K improves the 
performance. It is suggested that this could be that a greater number of initial 
positions allows greater freedom for position assignments to change. 

Table [T] compares the prediction performances on the Citeseer and Cora 
datasets with the published results of the best collective classifiers in [3S] and 
[55] . Collective classification is a type of relational learning which uses object 
attributes together with information about the links between objects to predict 
class labels. The results of the SBSN model demonstrate comparable prediction 
performance rather than consistently outperforming the collective classification 
models. The interesting result here is that while the collective classification 
models use attribute information (i.e. the word frequencies of the documents), 
the SBSN model does not. This shows that, for these datasets, the link structure 
alone is indicative of the class and suggests that beyond the topology the content 
information is redundant. 

Figures HHSl show the prediction performance on the citation and blog net- 
works as training set size varies (from 5% to 80%). Two semi-supervised meth- 
ods, weighted vote Relational Neighbour classifier (wvRN) [21] and Multi-Rank 
Walk (MRW) [43], were also run on the same datasets to provide a compari- 
so In comparing the SBSN model with the semi-supervised learners, it was 
found that the performances were surprisingly low (Figure S]). It was found that 
this was due to the networks being directed and that these algorithms were not 
designed for directed networks. The experiments were re-run using undirected 
versions of these networks which were constructed by repeating each link in the 
opposite direction (Figure [5]). It can be seen that the performance of the SBSN 
model (labelled SBSN_flat) degrades substantially as the proportion of labelled 
nodes decreases. This is because of the large search space of possible position 
assignments relative to the amount of data. To compensate for this, instead of 

Using code downloaded from 
mttp: / / www.cs.cmu.edu/~frank/code/asonam2010-code.zipl 
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Figure 3: Macro-Fl scores for the Citeseer (left) and Cora (right) dataset as the 
maximum number of positions (K) is varied. The "Supervised" data points show 
the performance of the SBSN model compared against the baseline "Unsuper- 
vised" version which is trained without using class labels to infer the positions. 



running a single M-step to intialise the distribution over tt, the distribution was 
initialised to favour assortative positions, i.e.: 



K 



a a + 

This initialisation, takes into account the assumption made by the semi- 
supervised algorithms that nodes link to other nodes of the same type (assorta- 
tivity). This results in the scores labelled SBSN_comm. 

Figure [7] shows performance of the SBSN on the classification of adjectives 
and nouns in the Words dataset. The graph shows that the model performs 
well with a maximum number of positions {K) between 5 and 20. For larger 
values of K the performance degrades due to overfitting of the training data. 
It can be seen that there is high variance in the performance of the SBSN 
model. Closer inspection of the results revealed a bimodal distribution for the 
prediction performance which was due to occasions where the model failed to 
fit the training data, i.e. the classification performance on the training set was 
poor (a similar situation was observed on the Agblog results shown in Figure 
[6]). To allow for this Figure [7] also shows the performance of models which fit 
the training set well based on a classification performance on the training set 
data of at least 0.9. Table [2] gives the performance scores of the SBSN model 
in comparison to the semi-supervised models, showing that the SBSN model 
is able to deal with disassortative classes while these semi-supervised models 
cannot. 
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Figure 4: Macro-Fl scores for the Citeseer (left) and Cora (right) datasets for 
different training set sizes. A comparison is made of the performance of naive 
initiahsation of tt (flat) and the performance a "community" -based initialisation 
(comm). Also included are two semi-supervised algorithms wvRN and MRW. 




Figure 5: Macro-Fl scores for the Citeseer (left) and Cora (right) datasets for 
different training set sizes. Similar to figure SI but here the algorithms wvRN 
and MRW are run on undirected versions of the networks. 




Figure 6: Macro-Fl scores for the Agblog dataset for different training set sizes. 
A comparison is made of the performance of naive initialisation of tt (flat) and 
the performance of "community" -based initialisation (comm). 
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Figure 7: Macro-Fl scores for the Words dataset as the maximum number of 
positions (K) is varied. Here a similar comparison is drawn between "Super- 
vised" and "Unsupervised" versions. It can be seen that there is high variance 
in the performance of the SBSN model due to occasions where the model has 
failed to fit the data. For comparison the Fl scores are shown for all the models 
which achieved an Fl score > 0.9 on the training set. 
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Table 1: Classification performance — SBSN vs. Relational Learners 





Citeseer network 


Cora network 




Fl 


Accuracy 


Fl 


Accuracy 


MF !35 


0.6291 


0.7267 


0.7970 


0.8261 


LBP [35^ 


0.6264 


0.7294 


0.8248 


0.8449 


Stacked [36 




0.598 




0.739 


SBSN 


0.6705 


0.7029 


0.8420 


0.8519 



Table 2: Classification performance — SBSN vs. Semi-supervised 





Words network 




Fl measure 


Accuracy 


wvRN [31] 


0.4216 


0.5289 


MRF gg; 


0.4411 


0.4614 


SBSN 


0.7462 


0.7484 



4.3 Block Model Analysis 

A property of the SBSN model which sets it apart from the methods of the 
previous section, is that it identifies the topological features, in the form of 
network positions, upon which the classification is based. Using the variable 
distributions (specifically tt and cj)) of the fitted model, qualitative analysis can 
be undertaken to understand the network positions and the pattern of links 
between them which are behind the classification decision. 

To understand the links between positions, a block model matrix is con- 
structed by rearranging the elements of tt into a matrix such that the rows 
and columns refer to the positions of the interaction source and target respec- 
tively. Figure [8] shows the blockmodel image matrix for the discovered positions 
in the Cora dataset. This matrix summarises the interaction patterns in the 
network and shows the relative probability of an interaction between different 
network positions (darker shaded blocks indicate higher probability). The di- 
agonal blocks of the image matrix indicate the presence of assortative groups 
(communities). 

Visualisation of distribution in a similar way shows which nodes (rows) are 
members of each of the positions (columns) ; darker shades indicate a higher level 
of membership. From Figure [S] it can be seen that positions 6 and 7 appear not 
to favour interactions with any position. Cross-referencing this with the image 
matrix of the (j) distribution in Figure |9] shows that there is only a low level 
of membership to these positions. Ordering the rows of the (j) image matrix 
such that nodes of the same class are adjacent reveals the positions that are 
indicative of each of the classes; for the Cora dataset this is highlighted in the 
right hand side of Figure [9] where the image is segmented and annotated with 
the subject labels. 

Figures [TOk and [TOh show the tt and image matrices respectively for the 
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Figure 8: Block model image matrix summarising the interaction patterns of the 
Cora network. The darker the block the higher the probability of interaction. It 
can be seen from the diagonal blocks that the positions found favour interacting 
with others of the same position 



Words dataset. The off-diagonal blocks of the blockmodel image show the dis- 
assortative nature of the positions. This image can be used to construct a 
summary network to describe the interactions between positions (Figure [TUb ). 
In this case the network summary forms a chain. Cross-referencing with the </> 
distribution indicates that the majority of interactions are from the adjectives 
of position 1 to the nouns of position 4. This reflects the fact that in the English 
language adjectives usually precede nouns. 



5 Conclusions and Future Work 

This work has presented the Supervised Blockmodel for Sparse Networks (SBSN), 
a model for jointly modelling relational and class label information. The pro- 
posed model has been demonstrated to perform well in predictive classification 
tasks on real world benchmark datasets. This model differs from other rela- 
tional and semi-supervised learning models because in addition to classification, 
it identifies topological features to explain the classification decision; where these 
features relate to the network positions from blockmodelling. Initialisation using 
prior knowledge of the data structure and the rejection of poor fitting models 
has been considered. Real world networks in which the SBSN is comparable to 
the classification performance of relational learners which use additional node 
attribute information has been demonstrated. The SBSN has been found to 
perform similarly to basline semi-supervised learners, although prior assump- 
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Figure 9: Image matrix of (f) distribution. Each row describes the membership 
of the nodes (columns) in the network. The nodes are ordered by class. The 
image matrix on the right is the one on the left annotated with class labels. It 
can be seen that each class usually corresponds to a particular network position 
(i.e. Neural Networks are usually position 4) 




Figure 10: Interpretation of the SBSN model applied to the Words dataset. 
The blockmodel image matrix (a) describes the probability of any pair of roles 
interacting, i.e. this is a visual representation of the multinomial parameter tt. 
This can be interpreted as a summary of the network interactions (b). The image 
matrix of the position memberships {4>) (c) shows which nodes (rows) belong 
to each of the network positions (columns). This shows that the Adjectives are 
usually position 1 and 3 while the Nouns are 2 and 4 
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tions (the same used by the semi-supcrviscd algorithms) are required to obtain 
good performance when the proportion of labelled data is small. In addition, 
the SBSN model can perform well with directed networks and disassortative 

classes. 

Although not investigated here, the current formulation of SBSN is amenable 
to weighted networks. It is left to future work to investigate the multiple sepa- 
rate classification tasks on a single network to explore how the predictive net- 
work positions change with the classification task. 
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